Dynamic-content web crawling through traffic monitoring

ABSTRACT

A dynamic-content web crawler is disclosed. These New Crawlers (NCs) are located at points between the server and user, and monitor content from said points, for example by proxying the web traffic or sniffing the traffic as it goes by. Web page content is recursively parsed into subcomponents. Sub-components are fingerpinted with a cyclic redundancy check code or other loss-full compression in order to be able to detect recurrence of the sub-component in subsequent pages. Those sub-components which persist in the web traffic, as measured by the frequency NCs ( 6 ) are defined as having substantive content of interest to data-mining applications. Where a substantive content sub-component is added to or removed from a web page, then this change is significant and is sent to a duplication filter ( 11 ) so that if multiple NCs ( 6 ) detect a change in a web page only one announcement of the changed URL will be broadcast to data-mining applications ( 8 ). The NC ( 6 ) identifies substantive content sub-components which repeatably are part of a page pointed to by a URL. Provision is also made for limiting monitoring to pages having a flag authorizing discovery of the page by a monitor.

This patent application claims priority from U.S. provisionalapplication 60/255,392 of the same title as the present applicationfiled on Dec. 15, 2000.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to techniques for systematicallylocating and monitoring information on the Internet, and in particularto a genre of such techniques known as “web crawlers.”

2. Background Description

Web Crawlers are programs used to find, explore, and monitor content onthe World Wide Web (WWW). They are the primary methods used by mostdata-mining applications such as search engines to discover and monitorWWW content. Due to the distributed nature of the WWW, crawlingcurrently represents the best method for understanding how content onthe WWW changes.

The WWW is a large connected graph of HyperText Markup Language (HTML)pages distributed over many computers connected via a network. The pagesare connected and accessed by Universal Resource Locators (URLs). TheseURLs are addresses to the HTML pages.

A crawler is seeded with a set of URLs. These URLs are placed in aqueue. For each of the URLs, the program downloads the page. It thenextracts the external URLs referenced on that page, before proceeding tothe page of the next URL in the queue. Each of the URLs extracted isthen added at the end of the queue with the other URLs the crawler wasseeded with. This process repeats indefinitely. The URLs collected andqueued in this fashion form a WWW graph, wherein each URL is linked to aseed URL, or to another URL on whose page the URL was found, and tothose other URL's referenced on the URL's page.

The foregoing crawling algorithm describes a breadth-first explanationof the WWW graph. Other methods of exploring content of the WWW may usedepth-first searches or hybrid solutions.

The problem with current crawlers is the fact that they have finiteresources and can get into infinite loops traversing the changing WWWgraph. By following one URL, that URL can bring up a page with otherURLs, and so on and so forth. Because these pages and URLs can begenerated dynamically (“dynamic content”) at the time of the request, acrawler can be faced with exploring an infinite graph.

When users or web crawlers make a request for a web page via its URL,the request is sent to a web server responsible for returning the HTMLpage requested. In the early days of the WWW, these web pages werestored as files on the permanent storage of the web server. The webserver was simply a “file server”. There was a 1 to 1 mapping between aURL and a specific web page. Since those early days web servers do notnecessarily simply serve back stored files. Many times the file isgenerated “on the fly” based on a number of parameters (URL withparameter string, cookies, time-of-day, user info, information in adatabase, prior history, etc.). These parameters are infinite in theirvariety and values. When pages are created in this manner, they arecommonly referred to as “dynamic content,” as opposed to the early“static content” that was simply non-changing web files.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a methodfor exploring and monitoring web content by monitoring web traffic inorder to feed data-mining applications in ways analogous to howweb-crawlers have done so in the past.

A further object of the invention is to explore and monitor dynamic andstatic content, not just static content, and identify web pages havingnew or deleted substantive content to one or a plurality of data-miningapplications.

Another object of the invention is to update web content indices using amethodology which is based upon a changing infinite graph model of theweb.

It is also an object of the invention to limit announcement of newcontent to those web pages which contain substantive content changes,ignoring changes in mere HTML code or other non-relevant changes.

Another object of this invention is to discover and monitor substantivecontent blocks that are common to many web pages.

Yet another object of the invention is to avoid custom time consumingintegration with web servers in order to access dynamic content, infavor of a universal “black box” solution.

A further object of this invention is to limit the resources required ofweb servers to service web crawlers.

The present invention provides a solution to these problems by modelingthe WWW as an infinite graph, which therefore cannot be explored in thenormal way. The graph of the WWW is explored by examining the finitecontent that is being generated by real users requesting real pages.Under this approach, “New Crawlers” (NCs) proxy the web traffic betweenusers and web servers, or “sniff” this traffic. Proxying or sniffing, orother methods that gain access to web requests and responses, shall fromthis point be called “proxying” to simplify the discussion. The readerwill however keep in mind the variety of methods available to accomplishthis task.

These NCs can proxy this content anywhere in the communications pathbetween the web server and end users. Alternatively they can sniff thecontent as it passes by, thereby not interfering with the communicationpathway. As content passes by, the crawlers examine the pages. Themethod of examination parses a page into sub-components. This process isrecursive and creates sub-components of sub-components, as will beevident from the appended pseudo-code and more detailed description ofthe parsing methods below. These sub-components are uniquely identifiedwith a fingerprint. With the dynamic content methodologies now used atservers to generate responses to a URL request, much of the content ofthe returned page may be different for each access, but thesedifferences are not significant from the viewpoint of data-miningapplications. The present invention filters these insignificantdifferences out by using the unique fingerprints to identify thosesub-components which persist. It does so over time and across multipleand different URLs and HTML pages. When one of these persistentsub-components is added to or deleted from a web page, this change isdefined as “substantive” for the purposes of describing the presentinvention.

The NCs can then send these pages and URL addresses back to the indexingor collection points of a system. Alternatively they can keep caches ofthese pages, or keep records of unique fingerprints of these pages, toreduce the number of pages that get sent back, and only send back pagesthat have received a threshold of access. Alternatively they can breakthe pages into components, cache the components (or their uniquefingerprints) and send back only those components that have received acertain threshold of access, irrespective of which actual web pages andURLs generated these components. Once a page or component has beenannounced to the data-mining systems, the NC acts as a block to stopretransmission of content already announced. Pages and components notresident in these caches are either new or have previously been expiredfrom the cache. Changes in any page or component will result in thecreation of a new page or new component on which this process willrepeat.

Since NCs may not be directly proxying information in front of adesignated web server for that content (as is the example if theproxying occurs at the ISP level), then there will be multiple NCs“announcing” new content from the same sub component or page. Forexample is one NC resides on an ATT backbone and another resides on anAOL backbone, both may encounter a web page from cnn.com. Since neitherNC knows what the other has seen, both will need to cache and monitorthe page. It is therefore a realistic outcome for the cnn page to beannounced multiple times. To handle this problem of duplicateannouncements the announcements are sent to an intermediary “DuplicationFilter” that acts to collapse multiple announcements into one. Thereforewhen new content is found separately by multiple NCs, only oneannouncement reaches the data-mining applications.

The announcement of new, changed, or deleted content by a plurality ofNCs to a plurality of data-mining applications can be accomplished anumber of ways, for example as further described in co-pendingapplication PCT/US01/14701 to the same assignee and inventors entitled“Relevant Search Rankings Using High Refresh Rate DistributedCrawlings.” Various forms of multicast technology may be employed.Messages can be sent to a central routing application that has knowledgeof the subscribers to the update messages; alternatively IP-multicastmay be used, or overlay networks and group communication systems may beused.

The use of a proxying NC removes considerable amounts of resource loadfrom web servers that would normally have to service requests ofconventional web crawlers. First, no actual requests are sent from theNC to the web servers. Therefore crawler based load is eliminated.Furthermore, because one or more NCs may feed many data-miningapplications, multiple sets of independently operated NCs are notnecessary, although they may be used for political reasons.

One alternative solution to using a proxying NC would be to build thechange detection into a web site. The advantage the use of NCs has overthis approach is in their “Black Box” feature. The NCs work with any webserver regardless of how the content is created, whereas integratinginto the creation processes of a web site is extremely complex andvaries greatly among the millions of sites. Many components are used togenerate the content a web server serves (databases, scripts, templates,etc). Change detection would need to be applied to all components inthose solutions that build the change detection into the web site.Furthermore a system of NCs maintains the integrity of the knowledge itis providing. If web site owners were responsible for announcing whentheir pages change, based on some common format, then there is noguarantee that web site owners will actually perform that function. Fora variety of reasons they may not follow the common format andannouncement rules. By using NCs at many points it the networkcommunications pathways, the NC owners avoid this problem and are ableto ensure the accuracy of the updates.

This system of NCs can be incorporated into existing data-mining andindexing methods with very little change. By proxying just in front ofthe web servers, the data can be more easily broken up based on domain.But the proxying also can be done at caching servers in front of webservers, on the Internet Service provider's networks, or on caches rightin front of the end user. By proxying content, the new crawlers canidentify which pages are being requested and view the resultingresponses. It is to be noted from the definition being used herein thatthe term “proxying” content requests includes “sniffing” (i.e. receivinga copy of the data stream) communication pathways. It is also to benoted that “proxying” can be implemented on a sampling of the contentrequests: full coverage of all requests is not necessary. However, themore content the NCs have to process the faster they are able toidentify relevant from non-relevant content.

In those cases where unannounced proxying is not allowed because oflegal requirements, a similar method could be employed. Caching, proxy,or sniffing servers would only cache, proxy, or sniff pages that have aspecial tag embedded in the HTML page. This tag would designate the pageas being available for discovery by the NC system. Thus the creators ofthe pages give implicit permission for the page to be discovered usingthis method. The tag could also specify an alternative URL address touse to access the page content in case sensitive user specificinformation is included with the page or the URL. This method also hasthe advantage of giving the creators of web pages the ability to selfrequest their pages a required number of times through a NC proxy serverwhich would then discover the page. Extensions to this model of the useof tags can be used to instruct the NC on how to handle tagged contentin many ways.

The downside to this approach is that content never requested by usersor creators will not be identified by the system. Although existingcrawlers may never have completely covered the entire WWW graph becauseof its infinite size, they may have explored possible content that hasnever been accessed by a user. By using both approaches together, asuperior solution can evolve. Additionally, web site creators canmanually make normal web requests for their content. This would passtheir content through an NC, solving the problem of content that isnever requested by normal users.

In one embodiment of the invention, a web crawler for handling staticand dynamic content uses a parsing algorithm to recursively parsing webpages into sub-components. It then uses a loss-full compressionalgorithm to assign a unique fingerprint to each of the sub-componentsparsed from a web page responsive to a URL. Then the parsing algorithm,the loss-full algorithm, and the respective sub-component fingerprintswith their corresponding URLs are sent to a data-mining application,which is then able to repeatably locate any of the sub-components.

In another embodiment a web crawler in accordance with the inventionmonitors web traffic at a plurality of points between webservers andusers, and recursively parses into sub-components the web pages somonitored, the web traffic comprising web pages responsive to URLs. Thenthe crawler assigns a unique fingerprint to each parsed sub-componentand keeps a count of the number of times each unique fingerprint recurs.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects and advantages will be betterunderstood from the following detailed description of a preferredembodiment of the invention with reference to the drawings, in which:

FIG. 1 depicts the proxying of web content by an NC.

FIG. 2 depicts the algorithm used to detect new relevant content pagesor sub-components.

FIG. 3 depicts the operation of many NCs working together to feed manydata-mining applications through a duplication filter.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

The prior art model of the web is the root cause of the problem withcurrent crawlers. The prior art model effectively assumes a finitegraph, which does not correspond to the reality of the web and istherefore in error. In reality, the graph changes all the time and thecontent at the nodes of the graph changes as well. Pages are created,destroyed, and changed. URLs are also created, destroyed, and changed.URL address mapping to web pages is not necessarily one-to-one (1-→1).Address mapping can also be many-to-one (N->1) or one-to-many (1-→N).

Where the mapping is many-to-one, many unique URL's retrieve the sameweb page. In a one-to-one mapping one unique URL retrieves the same webpage. These results are acceptable. But a one-to-many mapping means thatone URL gets many different web pages, which is not an acceptableresult.

When a user makes a request to a web server for a page, the URL for thepage is passed to the webserver. Yet this is not the only informationthat may be used in generating and returning a response. For example,information stored on the user's hard drive can be read and sent alongwith the URL request. Or the web server can use other information knownabout the user, or any other inputs needed to generate the page. Thismeans that a request from one user may produce results which aredifferent from the results produced from a request from another user,even when both use the same URL.

Therefore requests and responses can be grouped into three categories:

-   -   Responses based entirely on the URL    -   Responses based partly on the URL    -   Responses not based at all on the URL

When a URL returns a page which is different from the page returned thelast time the URL was requested, this means either:

-   -   A change has occurred in the content    -   This URL has a 1-→M mapping    -   Response is not entirely based on the URL

Identifying which URLs have changed and which have not is fairlystandard in the prior art methods for web crawling. Every time the pageis visited, a loss-full compression fingerprint of the record is made;examples include MD5 checksums, a variety of hashing algorithms, aCyclic Redundancy Check (crc) check-sum of the page's content, etc. Theweb page is run through an algorithm that produces a unique identifierof the page. On later visits by the web crawler, that fingerprint willbe different if the page content has changed. This idea of trackingwhich pages have changed is important for indexes and data-miningapplications. Most of these applications only want to see what ischanging on the WWW, not a continuous river of known information, 99% ofwhich they have already processed.

Yet dynamically generated content presents a problem. Thousands of URLscan all respond with slightly different HTML pages that all contain thesame news article content. It can be argued that the important thing towatch for is changes in substantive content, not necessarily changes inHTML code. If there is a new news story, or new information, then thisis a substantive change in content. Whereas a new or changed HTML pagethat changed the date or font and nothing else is not relevant for mostapplications. Therefore tracking changes in information blocks(sub-components) that may exist on a number of pages, helps to alleviatethe problem caused by the infinite number of pages that exist.

Referring now to the drawings, and more particularly to FIG. 1, there isshown a schematic of the operation of a New Crawler on the web, servinga data mining application. A user web browser 7 requests 1 a URL. TheNew Crawler 6 forwards 2 the URL request to the web server 5 specifiedby the URL. The web server 5 returns 3 the requested page to the NewCrawler 6, which forwards 4 a the returned page to the user web browser7. After processing the returned page, the New Crawler 6 sends 4 b newsub-components or pages to the data mining applications 8.

The processing done by the New Crawler 6 on the returned page 3 isfurther described with reference to FIG. 2. The invention provides a newway of looking at the returned HTML page 3, by breaking it up 9 intosub-components of content and arrangement. Sub-components make up anHTML page, and many pages can share sub components. These sub-componentscan be identified in a number of ways. One of the most obvious ways isto parse the HTML page using its Document Object Model parse tree, witheach sub-tree or node (internal and external) being considered a subcomponent. An alternative method would be to render the page as agraphical image, break up the image into smaller images and use theseimages as sub-components. Still another method would be to render theHTML page to text, (possibly using the Unix Lynx web browser), and textparsing the page into its paragraphs. Each paragraph of text would beconsidered a sub-component. The method is inconsequential and dependenton the needs of the applications the new crawler is feeding.

By tracking the relative frequency of accesses to these sub-componentsbased on user traffic, the “substantive content” can be discerned fromthe “non-substantive content”. Furthermore, “new” components announcedto data-mining applications are declared “new” after they have receivedenough accesses. The definition of “enough” is an algorithm orindependent constant or function that is determined by the owners of theNC system.

For example, suppose a web page is broken up into 3 components A, B, andC, with C being further broken up into components D and E. The next step101 in the process is to create a fingerprint for each component. Eachof these fingerprints is stored along with a count of the number ofaccesses. When other pages are broken up and accessed and also containany of the components A, B, C, D or E a further step 102 in the processwill check the fingerprint against those which have been stored. Ifthere is a match, the access count for the component will be incremented103. The process will then repeat these steps for the next component, asindicated by step 106. Suppose that components A and D reflect transientdynamic content and components B and E are persistent articles, with Cbeing a composite (for example, D is the current date and E being apersistent article). If 10 pages are broken up into components andcomponents B and E have counts of 10, while components A, C and D havecounts of 1, then one could say that components B and E contain highlyaccessed or “substantive” while component C contains rarely accessed or“non-substantive information”. A threshold can be established for theaccess count of a component or sub-component, so that when the accesscount reaches the threshold at step 105 the component or sub-componentis announced as new 104 before the process returns 107 to get the nextcomponent. Extensions to this algorithm may incorporate otherinformation into the threshold besides access count. For example, acombination of access count and classification of topic importance by aclassification engine may be used in determining threshold forannouncement.

By placing proxies or content sniffers in front of a plurality of websites, and connecting them through a Duplication Filter as shown in FIG.3, a system can be built to identify and report on changes tosubstantive sub-components. Without the Duplication Filter 11,individual NCs will not know what the other NCs have announced, becausethey work off of separate caches. But by having NCs 6 send announcementsof new components 104 to the intermediary Duplication Filter 11 thisproblem is resolved. The Duplication Filter 11 acts to collapse multipleannouncements into one. Communication is performed using standardnetworking approaches. Unicast or multicast techniques can be used whereappropriate. In the preferred embodiment of the invention, networkTCP/IP connections are initiated from the NC 6 to the Duplication Filter11. Messages from the Duplication Filter 11 are sent to data-miningapplications either through multicast or an overlay network. Without thepresence of the Duplication Filter 11, the NCs 6 would multicast theupdates to the data-mining applications directly. Therefore when newcontent is found separately by multiple NCs 6, only one announcementreaches the data-mining applications 8. In conjunction with aduplication filter, placing the NC's closer to the web servers will helpreduce duplications as well.

The advantage to this method is that sub-components can gain relevancefrom across many web pages. Thus a news article that appears on every“personalized” page for users can be identified as relevant content,while the “personalization” content (sub-components) of the page, suchas the users' name, will be disregarded as non-relevant. For example abook excerpt from amazon.com will exist in web pages given back to manyuser requests. However the request (URL) that generated that page (username, time of day, other parameters, etc) may never occur again. Yet thebook excerpt is handed out thousands of times a day. In this case thebook excerpt sub-component would be announced to the data-miningapplications while the other elements unique to the user requests willnot be.

Specifically new crawlers filter the web traffic from web servers. Theybreak the HTML responses up into sub-components and take 128 bitfingerprints of the sub-components to uniquely identify them (ID). Theythen record the number of hits an ID receives. When a sub-componentreceives a threshold number of hits, and the crawler can identify a URLthat reliably accessed a page with the sub-component, then the crawlerwould announce this sub-component and URL as new.

This announcement would be received by any application that was trackingchanges to the content on this web server. The testing of reliabilitywould be performed by the NC requesting the page again with the sameparameters that were used by one of the pages that incremented theaccess count for that sub-component. If the page returns again andcontains the same sub-component, the sub-component is linked to therequest parameters and both are announced to the data-miningapplications. This testing of reliability demonstrates that thesub-component is “repeatably” accessible from this URL string. Theseapplications can now use those parameters to gain access to the page ifthey wish to see the sub-component.

The pseudo-code below represents the algorithm. In addition, there wouldneed to be a mechanism to expire the sub-components in the system afterthey have not been hit in a long time. This need arises from the factthat the server cannot have infinite memory to store all the newfingerprints of the sub-component. Many of which will only be hit onceand never again. A simple algorithm to expire components would walkthrough the data cache and expire components that have not been accessedin a long time (to be defined by available resources). This can be doneperiodically or continually as a background thread.

The method of repeatability will also be used in the expirationalgorithms. Pages may be periodically retested for repeatability, and ifthey are determined not to be repeatable they are expired.

The method of the present invention is implemented in the followingpseudo-code. Note that sub-sections should be all possible parse-trees,and all possible permutation where you remove a sub-tree. This takescare of the problem where you care about a main page changing, but nothow it changes.

Pseudo Code Algorithm //Subsections should be all possible parse-trees,and //all possible permutation where a sub-tree is //removed. This takescare of the problem where you //care about a main page changing, but nothow it //changes. class SubComp{ boolean is_Root; SubComp root_Comp;long finger_Print; long access_count; long last_change; HashSetaccess_URLS; } void ProcessWebServerReturnPage(Url, Page, responseCode){//only use valid HTML responses if((responseCode < 200 ) ∥(responseCode >= 300)) return; //parse the page into sub sectionsSubComp_TREE.create(Page); //OPTIMIZATION: //If root component receivesenough accesses and //can be confirmed to always map to this URL, //thenremove this any other root component that //had this URL in itsaccess_URL set. // -Remove the URL from other root component //access_URLS because now this URL has // permanently shifted to a newroot page // -Send removal notices of these other root // components ifremovals occur // -Delete root component //OPTIMIZATION: //To reducestatic content redundancy (multiple //announcements) Only announcenon-root components //if the set of ULRs with this sub-component //isgreater then it's root_component URLs. // -This identifies TRUE dynamiccontent, not just // static content and static content with // multipleaccess paths. SubComp root = CRC_TREE.root( ); while(CRC_TREE.hasNext()){ finger_print = HASH( CRC_TREE.next( ) ); SubComp Comp; if(COMP_CACHEdoes not contain finger_print){ Comp = new SubComp(finger_print); COMPCACHE.put(comp); } else{ Comp = COMP_CACHE.get(finger_print); }Comp.access_count++; Comp.last_change = NOW; Comp.root_Comp = root;if(!Comp.access_URLS.contains(Url)){ if(Url.isRepeatable( )){ //requestthe page again and if it //contains this subcomponent, then //it isrepeatable Comp.access_URLS.add(Url); } } if(Comp was not Announced){if(Comp.access_count = = Threshold){ Send_New_Component_Detected(Url,Comp); } } } } //+ Expire Comp that are not accessed often //+ ExpireComp that are part of static content and //+ root has been announced //+Expire Comp with high counts but no Repeatable //+ nodes. //ExpireNodeswill be called periodically by //a background thread ExpireNodes( ) {//Algorithm dependent on available resources }

While the invention has been described in terms of a single preferredembodiment, those skilled in the art will recognize that the inventioncan be practiced with modification within the spirit and scope of theappended claims.

1. A method for web crawling that handles static and dynamic content,comprising the steps of: monitoring web traffic at a plurality ofpoints, each said point being between a webserver and a user, said webtraffic comprising web pages responsive to URLs; for a plurality of webpages in said web traffic, recursively parsing each said web page intosub-components; assigning a unique fingerprint to each said parsedsub-component; labeling as substantive those said sub-components whosefingerprints recur in monitored web traffic, said recurrence being inexcess of a threshold metric; identifying as changed those web pages insaid web traffic wherein a substantive sub-component is added orremoved; eliminating duplicates in changed web pages identified in saididentifying step; and announcing said changed web pages to data-miningapplications.
 2. The method of claim 1, wherein said monitoring isaccomplished by proxying said web traffic.
 3. The method of claim 1,wherein said parsing includes using a parse tree of said web page, saidweb page having tree nodes and each tree node being a sub-component. 4.The method of claim 1, wherein said parsing includes rendering said webpage as a graphical image and breaking said image into smaller images,each said smaller image being a sub-component.
 5. The method of claim 1,wherein said parsing includes rendering said web page as text andparsing said text into paragraphs, each said paragraph being asub-component.
 6. The method of claim 1, wherein said substantivesub-components are expired after a period of time without recurrence. 7.The method of claim 1, wherein said monitoring is limited to those webpages embedded with a tag designating said page as available fordiscovery.
 8. A method for filtering dynamically generated content fromchange detection engines serving data-mining applications, comprisingthe steps of: recursively parsing web pages responsive to URL requestsinto sub-components, said web pages appearing in web traffic; assigninga unique fingerprint to each said parsed sub-component; labeling assubstantive those said sub-components whose fingerprints recur inmonitored web traffic, said recurrence being in excess of a thresholdmetric; identifying as changed those web pages in said web trafficwherein a substantive sub-component is added or removed; and eliminatingduplicates in changed web pages identified in said identifying step. 9.The method of claim 8, wherein said identification step includes thefurther step of determining that said substantive sub-component isrepeatably contained in said web page response to a URL request.
 10. Themethod of claim 8, further comprising the step of announcing saidchanged web pages to data-mining applications.
 11. The method of claim10, wherein said identification step includes the further step ofdetermining that said substantive sub-component is repeatably containedin said web page.
 12. A method for web crawling that handles static anddynamic content, comprising the steps of: monitoring web traffic at aplurality of points, each said point being between a webserver and auser, said web traffic comprising web pages responsive to URLs; for aplurality of web pages in said web traffic, recursively parsing eachsaid web page into sub-components; assigning a unique fingerprint toeach said parsed sub-component; keeping a count of recurrence of eachsaid unique fingerprint.
 13. The method of claim 12, further comprisingthe step of determining those said sub-components for whom said count isin excess of a threshold number.
 14. The method of claim 13, furthercomprising the steps of identifying as changed those web pages in saidweb traffic wherein a substantive sub-component is added or removed;eliminating duplicates in changed web pages identified in saididentifying step; and announcing said changed web pages to data-miningapplications.
 15. A computer program for web crawling that handlesstatic and dynamic content, comprising: a routine for monitoring webtraffic at a plurality of points, each said point being between awebserver and a user, said web traffic comprising web pages responsiveto URLs; a routine for recursively parsing each said web page intosub-components; a routine for assigning a unique fingerprint to eachsaid parsed sub-component; a routine for labeling as substantive thosesaid sub-components whose fingerprints recur in monitored web traffic,said recurrence being in excess of a threshold metric; a routine foridentifying as changed those web pages in said web traffic wherein asubstantive sub-component is added or removed; a routine for eliminatingduplicates in changed web pages identified in said identifying step; anda routine for announcing said changed web pages to data-miningapplications.
 16. A method for web crawling that handles static anddynamic content by monitoring web traffic at a plurality of points, eachsaid point being between a webserver and a user, said web trafficcomprising web pages responsive to URLs.
 17. A method for web crawlingthat handles static and dynamic content, comprising the steps of: usinga parsing algorithm to recursively parsing web pages responsive to URLrequests into sub-components, said web pages appearing in web traffic;using a loss-full algorithm to assign a unique fingerprint to each saidparsed sub-component in each said URL; sending to a data-miningapplication said parsing algorithm, said loss-full algorithm, and saidsub-component fingerprints correlated to each corresponding URL, whereinsaid data-mining application is enabled thereby to repeatably locate anyof said sub-components.
 18. The method of claim 17, further comprisingthe steps of: labeling as substantive those said sub-components whosefingerprints recur in monitored web traffic, said recurrence being inexcess of a threshold metric; identifying as changed those web pages insaid web traffic wherein a substantive sub-component is added orremoved; and eliminating duplicates in changed web pages identified insaid identifying step.
 19. The method of claim 18, wherein saidthreshold metric is an algorithm that uses a count of said recurrence asa parameter.
 20. The method of claim 19, wherein said threshold metricis an algorithm that uses at least one additional factor besides saidcount as a parameter.